Scoring, M2 Data Science for Social Sciences

Author

Théo Druilhe, Pierre Larose, Nathan Pizzetta, Sigurd Saue

Context

In this project, the objective is to develop a predictive model for corporate bankruptcy, utilizing historical financial and default data. Companies may encounter various forms of financial distress, such as missed payments, distressed exchanges, or formal bankruptcy proceedings like Chapter 7 and Chapter 11 filings. By analyzing these events and their associated financial indicators, this project aims to build a dataset that captures each company’s fiscal health over time and leverages machine learning to forecast the likelihood of default.

The work involves combining multiple datasets—Compustat, LoPucki, and Moody’s—which each track different aspects of corporate financial distress. After careful preprocessing to align and clean the data, the most recent financial and default indicators for each company are extracted. These serve as the target variable \(Y\) for bankruptcy risk, which will later be matched with a comprehensive feature set \(X\) to train the predictive model. This analysis aims to provide insights into the patterns that precede default and to contribute a robust tool for assessing bankruptcy risk in real-world scenarios.

Building our dataset

In this section we will focus on building our target variable \(Y\) for bankruptcy prediction and our feature set \(X\) for training the predictive model.

Building Y - Bankruptcy Data

To build our target variable we will go through the following steps:

  1. Extracting the relevant data from the Compustat, LoPucki, and Moody’s datasets.
  2. Merging these datasets to create a comprehensive dataset that captures the financial health and default history of each company.
  3. Creating the target variable \(Y\) based on bankruptcy events within 1 year of each fiscal year.

Extracting Data

<<<<<<< HEAD

The comp.company database contains some information for a wide range of companies: address, phone number, website url, …
For our analysis we are interested in the deletion date and the deletion reason which gives us potential information about companies that could have been bankrupted.
To do so, we will extract the company name, the key, the reason for deletion, and the deletion date.

Deletion Reason :
- 2: Bankruptcy
- 3: Liquidation

=======

The comp.company database contains some information for a wide range of companies: address, phone number, website url, …
For our analysis we are interested in the deletion date and the deletion reason which gives us potential information about companies that could have been bankrupted.
To do so, we will extract the company name, the key, the reason for deletion, and the deletion date.

>>>>>>> main

Example of code to connect to WRDS database and extract data from Compustat database

Code
library(tidyverse)
library(dbplyr)
library(RPostgres)

wrds <- dbConnect(
    Postgres(),
    host = "wrds-pgdata.wharton.upenn.edu",
    dbname = "wrds",
    port = 9737,
    sslmode = "require",
    user = Sys.getenv("wrds_user"),
    password = Sys.getenv("wrds_password")
)

compustat_list <- dbListObjects(wrds, Id(schema = "comp"))

# Use dplyr verbs with a remote database table
# https://dbplyr.tidyverse.org/reference/tbl.src_dbi.html
funda_db <- tbl(wrds, in_schema("comp", "funda"))
funda_db %>%
  filter(grepl('APPLE INC', conm)) %>%
  select(gvkey, fyear, conm, at, wcap, re, ebit, lt, sale) %>%
  mutate(WCTA = wcap / at,
         RETA = re / at,
         EBTA = ebit / at,
         TLTA  = lt / at, # as a proxy for ME/TL
         SLTA = sale / at)
Code
library(dplyr)
library(lubridate)
source("../scripts/helper_functions.R")

# Construct the file path in a way that works on both macOS and Windows
file_path <- file.path("..", "data", "wrds_data", "company_all.rds")

# Check if the file exists and read the Excel file
if (file.exists(file_path)) {
  company_db <- readRDS(file_path)
} else {
  print("File not found. Please check the file path.")
}

# Extract relevant company information
compustat_company_data <- company_db %>%
    filter(is.na(dldte) | (year(dldte) >= "2011")) %>% 
    select(gvkey, conm, dlrsn, dldte) %>%
    collect()

styled_dt(compustat_company_data)
<<<<<<< HEAD
=======
>>>>>>> main

This dataset is key because it links the fiscal year with the financial data. To use it porperly we will try to keep all the companies where we can be confident if they were bankrupt or not. The objective of predicting a company bankruptcy on the next year relies on the ongoing fiscal year, it is therefore necessary to have it for all our companies in our dataset.

Code
# Construct the file path in a way that works on both macOS and Windows
file_path <- file.path("..", "data", "wrds_data", "compustat_all.rds")

# Check if the file exists and read the Excel file
if (file.exists(file_path)) {
  compustat_all <- readRDS(file_path)
} else {
  print("File not found. Please check the file path.")
}

# Select relevant financial metrics and compute financial ratios
fiscal_year <- compustat_all %>%
    filter(fyear >= "2010") %>% 
    select(gvkey, fyear)

fy_size <- fiscal_year %>% group_by(gvkey) %>% filter(fyear==max(fyear)) %>% dim

cat("Number of companies within this period of time :", fy_size[1])
Number of companies within this period of time : 21758

Lopucki database includes approximately one thousand large public company bankruptcies filed in the United States Bankruptcy Courts under Chapter 7 or Chapter 11 since October 1, 1979 to December 11, 2022.

  • Chapter 7 : liquidation
  • Chapter 11 : reorganization

We will extract the company name, the key, the filling date, and the chapter.

Code
# Construct the file path in a way that works on both macOS and Windows
file_path <- file.path("..","data", "default_data", "Bankruptcy - LoPucki", "Florida-UCLA-LoPucki Bankruptcy Research Database 1-12-2023.xlsx")

# Check if the file exists and read the Excel file
if (file.exists(file_path)) {
  lopucki_db <- readxl::read_xlsx(file_path)
} else {
  print("File not found. Please check the file path.")
}

# Filter for Chapter 7 (liquidation) and Chapter 11 (reorganization) bankruptcies
lopucki_clean <- lopucki_db %>%
    filter((Chapter %in% c("7", "11")) & (Disposition != "Chapter 7 at filing") & (year(DateFiled) >= "2011")) %>%  # Exclude Chapter 7 at filing as advised in the do file of the dataset
    select(GvkeyBefore, NameCorp, DateFiled, Chapter) %>%
    group_by(GvkeyBefore) %>%
    summarize(DateFiled = min(DateFiled),
              NameCorp = NameCorp[which.min(DateFiled)],
              Chapter = Chapter[which.min(DateFiled)]) %>% 
    mutate(DateFiled = lubridate::as_date(DateFiled)) %>% 
    ungroup()

styled_dt(lopucki_clean)
<<<<<<< HEAD
=======
>>>>>>> main

Moody’s Annual Default Reports for Corporates provide detailed analysis of corporate bond defaults and credit rating changes. These reports track default rates, recovery rates, and rating transitions for companies around the world.

Data was extracted from Moody’s Annual reports covering Corporate Default starting in 2002, it is more heterogeneous than LoPucki and not directly linkable to Compustat since only the name of the company is reported, also depending on the year, either full date, the month or nothing is reported regarding the default/bankruptcy event.

What is interesting for us with this dataset, is that a lot of companies are deleted from comp.company for a different reason than credit (Acquisitions, Other) but are in fact defaulting companies in Moody’s dataset.

Code
# Construct the file path in a way that works on both macOS and Windows
file_path <- file.path("..","data", "default_data", "dat_default_moodys_annual.rds")

# Check if the file exists and read the Excel file
if (file.exists(file_path)) {
  moody_db <- readRDS(file_path)
} else {
  print("File not found. Please check the file path.")
}

moody_clean <- moody_db %>%
    filter(year(default_date) >= "2011") %>% 
    select(company_name, default_type, default_date)

# Inspect the best matches
styled_dt(moody_clean)
<<<<<<< HEAD
=======
>>>>>>> main

Merging Datasets

Compustat and LoPucki

To merge the Compustat and LoPucki datasets, we will use the gvkey (Compustat) and GvkeyBefore (LoPucki) columns as the key for merging.

Code
# Merge Compustat and LoPucki data based on gvkey (Compustat) and GvkeyBefore (LoPucki)
comp_lopucki <- compustat_company_data %>%
    left_join(lopucki_clean, by = c("gvkey" = "GvkeyBefore"))

Comp_lopucki with Moody

To manage to merge the Moody’s dataset with the Compustat dataset, we need to standardize the company names in both datasets. We will remove punctuation, whitespace, and common company suffixes to create a common key for merging.

Code
# Standardize company names in both datasets
trim_pattern <- ",|\\.|INC|LLC|CORPORATION|CORP|COMPANY|\\*|\\s+"

# Clean Moody's data
moody_clean <- moody_clean %>%
    mutate(
        compact_name = stringr::str_to_upper(company_name),
        compact_name = stringr::str_trim(gsub(trim_pattern, "", compact_name))
        )

# Clean Compustat company data
comp_lopucki <- comp_lopucki %>%
    mutate(compact_name = stringr::str_trim(gsub(trim_pattern, "", conm))
           )

comp_lopucki_moody <- comp_lopucki %>%
                    left_join(moody_clean, by = c("compact_name")) %>%
                    filter(!is.na(gvkey)) %>%
                    select(gvkey, conm, dlrsn, dldte, DateFiled, Chapter, default_date, default_type)

Comp_lopucki_moody with Fiscal Year

Here, we will merge the combined dataset with the fiscal year data to create a comprehensive dataset that captures the financial health and default history of each company.

The Y variable will be created based on bankruptcy events within 1 year of each fiscal year taking into account the following criteria : - Chapter 7 and Chapter 11 bankruptcies from LoPucki, - Moody's default date and events that are bankruptcy or Chapter 11 related, - Computstat deletion date and reason (02, 03).

Code
# Merge financial data with the complete bankruptcy data
bankruptcy_data <-fiscal_year %>%
    left_join(comp_lopucki_moody, by = "gvkey") %>%
    mutate(
        # Define target variable Y based on bankruptcy events within 1 year of each fiscal year
        Y = ifelse(
            # Compustat bankruptcy within 1 year after fiscal year
            (!is.na(dldte) & dlrsn %in% c("02", "03") & (fyear + 1 == year(dldte))) |
            # LoPucki bankruptcy within 1 year after fiscal year
            (!is.na(DateFiled) & (fyear + 1 == year(DateFiled))) |
            # Moody's default within 1 year after fiscal year
            (!is.na(default_date) & (grepl("Bankruptcy", default_type, ignore.case = TRUE) | grepl("Chapter 11", default_type, ignore.case = TRUE)) & (fyear + 1 == year(default_date))),      
            1, 
            0
        )
    )

Cleaning Bankruptcy Dataset

In our dataset, some companies went bankrupt at some year. But then, they are still in the dataset for the following years. For these companies, we need to delete the years following the bankruptcy year.

Code
# Identify bankruptcy years for each company
bankrupt_years <- bankruptcy_data %>%
  filter(Y == 1) %>%
  select(gvkey, bankruptcy_fyear = fyear)

# Join bankruptcy years back to the original data
bankruptcy_data_cleaned <- bankruptcy_data %>%
  left_join(bankrupt_years, by = "gvkey") %>%
  # Keep rows where either there is no bankruptcy year or fyear is less than or equal to bankruptcy year
  filter(is.na(bankruptcy_fyear) | fyear <= bankruptcy_fyear) %>%
  select(-bankruptcy_fyear)

We already have well specified the companies that went bankrupt, but we need to clean the dataset to keep only the most relevant “non bankrupt” companies.

To do so, we will look at the reason of deletion for the companies that are not bankrupt and delete the companies with a deletion reason that cannot tells us if they went bankrupt.

For example, a company deleted for the reason “Acquisition or merger” should be keep in the dataset since it is not a bankruptcy reason. But, a company deleted for the reason “Other” should be deleted from the dataset since we cannot know if it went bankrupt or not.

Deletion Reasons used here : - 05: No longer fits file format - 07: Other (no longer files with SEC among other possible reasons) but pricing continues. - 10: Other (no SEC filings, etc)

Code
# Remove rows for companies with no bankruptcies and specific dlrsn values
bankruptcy_data_cleaned <- bankruptcy_data_cleaned %>%
  group_by(gvkey) %>%
  filter(!(sum(Y) == 0 & dlrsn %in% c(05, 10, 07))) %>%
  ungroup()

Extracting target Y from Bankruptcy Data

Final Bankruptcy Dataset:

Code
styled_dt(bankruptcy_data_cleaned)
<<<<<<< HEAD
=======
>>>>>>> main


Summary of the target variable \(Y\) for bankruptcy prediction:

Code
# Select only gvkey and Y columns to use as the target variable dataset
target_Y <- bankruptcy_data_cleaned %>%
    select(gvkey, fyear, Y)

# Count the number of bankrupt companies
n_bankrupt <- sum(target_Y$Y)
n_non_bankrupt <- nrow(target_Y) - n_bankrupt

# Display the number of bankrupt and non-bankrupt companies
cat(
    " Number of bankrupt companies: ", n_bankrupt, "\n",
    "Number of non-bankrupt companies: ", n_non_bankrupt, "\n",
    "Proportion of bankrupt companies: ", round(n_bankrupt / nrow(target_Y), 3), "\n",
    "Proportion of non-bankrupt companies: ", round(n_non_bankrupt / nrow(target_Y), 3), "\n"
)
 Number of bankrupt companies:  1409 
 Number of non-bankrupt companies:  152668 
 Proportion of bankrupt companies:  0.009 
 Proportion of non-bankrupt companies:  0.991 
Code
# Save the bankruptcy dataset and target variable in a .rds file
saveRDS(bankruptcy_data_cleaned, file = "../data/our_data/bankruptcy_data.rds")
saveRDS(target_Y, file = "../data/our_data/target_Y.rds")